Language identification for the automatic grapheme-to-phoneme conversion of foreign words in a German text-to-speech system

نویسنده

  • Peter Henrich
چکیده

The German language is interspersed wi th words, mostly foreign words, which cannot be converted correctly by German graphemeto-phoneme rules. Almost all of these words belong to the English or French language. In order to convert the whole text correctly, it is necessary to identify the language of each word and then to use the corresponding set of grapheme-to-phoneme rules. The problern of language identification is Centrally important for almest every application of automatic text output when its source is a multilingual text database. FURTHER APPLICATIONS The possibility of detecting the language of a written text automatically opens up some central fields of application. The following two may serve as an example: A library of scientific literature contains, e.g. 1 texts written in English 1 German and French. If a blind person likes to get access to the library via a text-to-speech system 1 this system may detect the language and acti va te the corresponding rule set and then read the selected text aloud correctly. Or 1 a list of available articles of a selected topic may serve as an example. Such a list may be available only on micro fiches at the moment but callable by phone if the language of the title and the author's name can be detected automatically. BASIC SYSTEM For all these applications the same basic architecture can be used (figure 1). Supposed that grapheme-to-phoneme rule sets with an accuracy of 100% are available for the three languages mentioned above 1 the performance of a multilingual system only depends on the quality of the language identifier. The main problern is how to detect the language aut:omatically. The only informa·tion that can be used by a language identification system is the character code of the written text 1 the word length and eventually the domain of a word within a sentence. ASCII-text (L.1, L.2, .. , L.n) I languageldentlfler grapheme-to-phoneme converters Fig. 1: Multilingual Grapheme-tc-Phoneme Conversion Basically 1 two methods of evaluating this information in order to identify the language are feasible: 1. Systems which are based on statistics calculate the mean probability of all grapheme transitions of a word for the possible languages and select the most probable one. 2. Rule-based systems evaluate multiple features in order to exclude step by step those languages a word definitely does not belong to. The necessary information for this evaluation can be found either on word or sentence level. USED CORPORA For the development and the evaluation of all language identification systems that will be discussed in the following the same text corpora have been used. These corpora have been collected in the ESPRITProject 291/860. They consist of bureau communication texts 1 200 1 000 words in English 1 French and German each. From these corpora word lists have been extracted and subsequently they have "~een freed from foreign words 1 words containing language specific characters and abbreviations.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dialect variation in Boro Language and Grapheme-to-Phoneme conversion rules to handle lexical lookup fails in Boro TTS System

It is not possible to include all the words in a natural language for general text-to-speech system. Grapheme-tophoneme conversion system is essential to pronounce a word which is out of vocabulary. Grapheme-to-phoneme rules play a vital role where lexical lookup fails. Though basic Grapheme-tophoneme rules system is very simple yet it is very powerful for naturalness of a TTS system. Letter-to...

متن کامل

Statistical Grapheme to Phoneme Conversion using Language Origin

This report describes a method for grapheme to phoneme conversion using statistical models of pronunciation. The available techniques for this conversion are first described and examples of each are given. A baseline system which uses Hidden Markov Models to represent phonemes in English is described and evaluated. The results from the baseline system serve to replicate previous research and to...

متن کامل

Design Issues in Automatic Grapheme-to-Phoneme Conversion for Standard Yorùbá

Grapheme-to-Phoneme (G2P) conversion is an important problem in Human Language Processing development, particularly Textto-Speech (TTS). Its primary goal is to accurately compute the pronunciation of words in the input texts. This work examines design issues with respect to components of the automatic G2P for standard Yorùbá (SY). The automatic process includes: (i) Tokenisation of Input, (ii) ...

متن کامل

A Korean grapheme-to-phoneme conversion system using selection procedure for exceptions

Cultural, social, economic and other various environmental factors affect our language, and different words and terminology are used and coined for different contexts, which triggers quantitative change of vocabulary of a language. This paper presents a Korean grapheme-to-phoneme conversion system using a selection procedure for exceptions from added text corpus, which reflects such dynamic nat...

متن کامل

Text-to-speech with cross-lingual neural network-based grapheme-to-phoneme models

Modern Text-To-Speech (TTS) systems need to increasingly deal with multilingual input. Navigation, social and news are all domains with a large proportion of foreign words. However, when typical monolingual TTS voices are used, the synthesis quality on such input is markedly lower. This is because traditional TTS derives pronunciations from a lexicon or a Grapheme-To-Phoneme (G2P) model which w...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1989